From Paraphrase Database to Compositional Paraphrase Model and Back

2017-07-07

words

congruent:全等

Abstract

The paraphrase Database (PPDB) consists of a list of phrase pairs with heuristic confidence estimates. Its coverage is also necessarily incomplete.

They propose models to:

score paraphrase pairs more accurately than PPDB’s internal scores
improve its coverage

They also introduced two new, manually annotated datasets to evaluate short-phrase paraphrasing models:

Annotated-PPDB
ML-Paraphrase

Introduction

Paraphrase detection:the task of analyzing two segments of text and determining if they have the same meaning despite differences in structure and wording.

Drawbacks of PPDB:

lack of coverage
that PPDB is a nonparametric paraphrase model the number of parameters (phrase pairs) grows with the size of the dataset used to build it.

What this work do:

they show the initial skip-gram word vectors can be fine-tuned for the paraphrase task by training on word pairs from PPDB called PARAGRAM word vectors.
they show that their resulting word and phrase representation are effective on a wide variety of tasks

Contributions:

Provide new PARAGRAM word vectors which improves performance in sentiment analysis and achieve the-stat-of-art in SimLex-999
Provide ways to use PPDB to embed phrases
Introduce two new datasets

New Paraphrase Datasets

Annotated-PPDB

Most existing paraphrase focus on words, like SimLex-999 or entire sentences, such as the Microsoft Research Paraphrase Corpus.

filter phrases for quality
filter by lexical overlap
select range of paraphrasabilities (可译释性)
Annotate with Mechanical Turk

Finally, they selected 1260 phrase pairs from the 3000 annotations. These 1260 examples were then randomly split into a development set of 260 examples and a test set of 1000 examples.

ML-Paraphrase

The second newly-annotated dataset is based on the bigram similarity task. They found that the annotations were not consisten with the notion of similarity central to paraphrase tasks. For instance, television set and television programme were the highest rated phrases in the NN section. older man and elderly woman also is one of the highest ranked JN pairs.

Paraphrase Models

The goal is to embed phrases into a low-dimensional space such that cosine similarity in the space corresponds to the strength of the paraphrase relationship between phrases. Recursive neural network was usde.

For phrase $p$, they compute its vector $g(p)$ through recursive computation on the parse. If $p$ is parent node and $c_1$ and $c_2$ are its child nodes:

$g(p)=f(W[g(c_1);g(c_2)]+b)$

The $W$ is not word embeddings. It’s weight matrix.
If $p$ does not have child nodes:

$g(p)=W_w^{(p)}$

The objective function follows:

$$min_{W,b,W_w}=\frac{1}{|X|}(\sum_{<x_1,x_2>\in X} max(0,\delta-g(x_1) \cdot g(x_2) + g(x_1) \cdot g(t_1)) \\ + max(0,\delta-g(x_1) \cdot g(x_2) + g(x_2) \cdot g(t_2)))+ \\ \lambda_W(||W||^2+||b||^2)+\lambda_{W_w}||W_{w_{initial}}-W_w||^2$$

where $\delta$ is the margin (set to 1 in all of the experiments), and $t_1$ and $t_2$ are carefully-selected negative examples taken from a mini-batch during optimization.

The intuition for this objective is that we want the two phrases to be more similar to each other ($g(x_1)\cdot g(x_2)$) than either is to their respective negative examples $t_1$ and $t_2$, by a margin of at least $\delta$.

### Selecting Negative Examples ###
To select $t_1$ and $t_2$, the most similar phrase in the mini-batch is chosen.

$t_1 = argmax_{t:<t,\cdot> \in X_b \ \{<x_1,x_2>\}} g(x_1) \cdot g(t)$

where $X_b \subseteq X$ is the current mini-batch.

### Training Word Paraphrase Models ###
To train just word vectors on word paraphrase pairs:

$$min_{W_w}=\frac{1}{|X|}(\sum_{<x_1,x_2>\in X} max(0,\delta-W_w^{(x_1)} \cdot W_w^{(x_2)} + W_w^{(x_1)} \cdot W_w^{(t_1)}) \\ + max(0,\delta-W_w^{(x_1)} \cdot W_w^{(x_2)} + W_w^{(x_2)} \cdot W_w^{(t_2)})+ \\ \lambda_{W_w}||W_{w_{initial}}-W_w||^2$$

Experiments-Word Paraphrasing

Training Procedure

They did a coarse grid search over a parameter space for $\lambda_{W_w}$ and the mini-batch size. They trained for 20 epoches for each set of hyperparameters using AdaGrad.

Tunning and Evaluation

Maximum $2\times$ WS-S correlation minus the WS-R correlation. The idea was to reward vectors with high similarity and relatively low relatedness, in order to target the paraphrase relationship. They choose SL999 as their primary test set as it most closely evaluates the paraphrase relationship. Note for all experiments they used cosine similarity as their similarity metric and evaluated the statistical significance of dependent correlations using the one-tailed method of (Steiger, 1980).

这里有一个问题，WS-S数据集作为一个用于调试的数据，是否保证和SL999没有交集。

Experiments-Compositional Paraphrasing

Using a support vector regression model (epsilon-SVR) on the 33 features that are included for each phrase pair in PPDB. The parameters are tuned using 5-fold cross validation on our dev set. Then, the model was trained on the entire dev set after finding the best performing $C$ and $\epsilon$ combionation and evaluated on the test set of Annotated-PPDB.

Blog

Papers